The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Sat Jun 20 15:14:18 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Sat Jun 20 15:14:18 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.19.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.19.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.19.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.19.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 150 150
Albania 150 150
Algeria 150 150
Andorra 150 150
Angola 150 150
Antigua and Barbuda 150 150
Argentina 150 150
Armenia 150 150
Australia 1200 1200
Austria 150 150
Azerbaijan 150 150
Bahamas 150 150
Bahrain 150 150
Bangladesh 150 150
Barbados 150 150
Belarus 150 150
Belgium 150 150
Belize 150 150
Benin 150 150
Bhutan 150 150
Bolivia 150 150
Bosnia and Herzegovina 150 150
Botswana 150 150
Brazil 150 150
Brunei 150 150
Bulgaria 150 150
Burkina Faso 150 150
Burma 150 150
Burundi 150 150
Cabo Verde 150 150
Cambodia 150 150
Cameroon 150 150
Canada 2100 2100
Central African Republic 150 150
Chad 150 150
Chile 150 150
China 4950 4950
Colombia 150 150
Comoros 150 150
Congo (Brazzaville) 150 150
Congo (Kinshasa) 150 150
Costa Rica 150 150
Cote d’Ivoire 150 150
Croatia 150 150
Cuba 150 150
Cyprus 150 150
Czechia 150 150
Denmark 450 450
Diamond Princess 150 150
Djibouti 150 150
Dominica 150 150
Dominican Republic 150 150
Ecuador 150 150
Egypt 150 150
El Salvador 150 150
Equatorial Guinea 150 150
Eritrea 150 150
Estonia 150 150
Eswatini 150 150
Ethiopia 150 150
Fiji 150 150
Finland 150 150
France 1650 1650
Gabon 150 150
Gambia 150 150
Georgia 150 150
Germany 150 150
Ghana 150 150
Greece 150 150
Grenada 150 150
Guatemala 150 150
Guinea 150 150
Guinea-Bissau 150 150
Guyana 150 150
Haiti 150 150
Holy See 150 150
Honduras 150 150
Hungary 150 150
Iceland 150 150
India 150 150
Indonesia 150 150
Iran 150 150
Iraq 150 150
Ireland 150 150
Israel 150 150
Italy 150 150
Jamaica 150 150
Japan 150 150
Jordan 150 150
Kazakhstan 150 150
Kenya 150 150
Korea, South 150 150
Kosovo 150 150
Kuwait 150 150
Kyrgyzstan 150 150
Laos 150 150
Latvia 150 150
Lebanon 150 150
Lesotho 150 150
Liberia 150 150
Libya 150 150
Liechtenstein 150 150
Lithuania 150 150
Luxembourg 150 150
Madagascar 150 150
Malawi 150 150
Malaysia 150 150
Maldives 150 150
Mali 150 150
Malta 150 150
Mauritania 150 150
Mauritius 150 150
Mexico 150 150
Moldova 150 150
Monaco 150 150
Mongolia 150 150
Montenegro 150 150
Morocco 150 150
Mozambique 150 150
MS Zaandam 150 150
Namibia 150 150
Nepal 150 150
Netherlands 750 750
New Zealand 150 150
Nicaragua 150 150
Niger 150 150
Nigeria 150 150
North Macedonia 150 150
Norway 150 150
Oman 150 150
Pakistan 150 150
Panama 150 150
Papua New Guinea 150 150
Paraguay 150 150
Peru 150 150
Philippines 150 150
Poland 150 150
Portugal 150 150
Qatar 150 150
Romania 150 150
Russia 150 150
Rwanda 150 150
Saint Kitts and Nevis 150 150
Saint Lucia 150 150
Saint Vincent and the Grenadines 150 150
San Marino 150 150
Sao Tome and Principe 150 150
Saudi Arabia 150 150
Senegal 150 150
Serbia 150 150
Seychelles 150 150
Sierra Leone 150 150
Singapore 150 150
Slovakia 150 150
Slovenia 150 150
Somalia 150 150
South Africa 150 150
South Sudan 150 150
Spain 150 150
Sri Lanka 150 150
Sudan 150 150
Suriname 150 150
Sweden 150 150
Switzerland 150 150
Syria 150 150
Taiwan* 150 150
Tajikistan 150 150
Tanzania 150 150
Thailand 150 150
Timor-Leste 150 150
Togo 150 150
Trinidad and Tobago 150 150
Tunisia 150 150
Turkey 150 150
Uganda 150 150
Ukraine 150 150
United Arab Emirates 150 150
United Kingdom 1650 1650
Uruguay 150 150
US 150 150
US_state 489150 489150
Uzbekistan 150 150
Venezuela 150 150
Vietnam 150 150
West Bank and Gaza 150 150
Western Sahara 150 150
Yemen 150 150
Zambia 150 150
Zimbabwe 150 150
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5893
Alaska 1198
Arizona 1436
Arkansas 6255
California 5472
Colorado 5327
Connecticut 853
Delaware 359
Diamond Princess 95
District of Columbia 96
Florida 6198
Georgia 14035
Grand Princess 96
Guam 96
Hawaii 492
Idaho 2864
Illinois 8278
Indiana 8125
Iowa 7780
Kansas 6604
Kentucky 9346
Louisiana 5840
Maine 1491
Maryland 2249
Massachusetts 1422
Michigan 7250
Minnesota 6913
Mississippi 7249
Missouri 8379
Montana 2570
Nebraska 5103
Nevada 1161
New Hampshire 1016
New Jersey 2147
New Mexico 2548
New York 5483
North Carolina 8636
North Dakota 3157
Northern Mariana Islands 81
Ohio 7677
Oklahoma 5973
Oregon 2957
Pennsylvania 6008
Puerto Rico 96
Rhode Island 568
South Carolina 4204
South Dakota 4003
Tennessee 8338
Texas 18189
Utah 1305
Vermont 1365
Virgin Islands 96
Virginia 10977
Washington 3736
West Virginia 4128
Wisconsin 5930
Wyoming 1849
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 66376
China 150
Diamond Princess 131
Korea, South 121
Japan 120
Italy 118
Iran 115
Singapore 112
France 111
Germany 111
Spain 110
US 108
Switzerland 107
United Kingdom 107
Belgium 106
Netherlands 106
Norway 106
Sweden 106
Austria 104
Malaysia 103
Australia 102
Bahrain 102
Denmark 102
Canada 101
Qatar 101
Iceland 100
Brazil 99
Czechia 99
Finland 99
Greece 99
Iraq 99
Israel 99
Portugal 99
Slovenia 99
Egypt 98
Estonia 98
India 98
Ireland 98
Kuwait 98
Philippines 98
Poland 98
Romania 98
Saudi Arabia 98
Indonesia 97
Lebanon 97
Pakistan 97
San Marino 97
Thailand 97
Chile 96
Luxembourg 95
Peru 95
Russia 95
Ecuador 94
Mexico 94
Slovakia 94
South Africa 94
United Arab Emirates 94
Armenia 93
Colombia 93
Croatia 93
Panama 93
Serbia 93
Taiwan* 93
Turkey 93
Argentina 92
Bulgaria 92
Latvia 92
Uruguay 92
Algeria 91
Costa Rica 91
Dominican Republic 91
Hungary 91
Andorra 90
Bosnia and Herzegovina 90
Jordan 90
Lithuania 90
Morocco 90
New Zealand 90
North Macedonia 90
Vietnam 90
Albania 89
Cyprus 89
Malta 89
Moldova 89
Brunei 88
Burkina Faso 88
Sri Lanka 88
Tunisia 88
Ukraine 87
Azerbaijan 86
Ghana 86
Kazakhstan 86
Oman 86
Senegal 86
Venezuela 86
Afghanistan 85
Cote d’Ivoire 85
Cuba 84
Mauritius 84
Uzbekistan 84
Cambodia 83
Cameroon 83
Honduras 83
Nigeria 83
West Bank and Gaza 83
Belarus 82
Georgia 82
Bolivia 81
Kosovo 81
Kyrgyzstan 81
Montenegro 81
Congo (Kinshasa) 80
Kenya 79
Niger 78
Guinea 77
Rwanda 77
Trinidad and Tobago 77
Paraguay 76
Bangladesh 75
Djibouti 73
El Salvador 72
Guatemala 71
Madagascar 70
Mali 69
Congo (Brazzaville) 66
Jamaica 66
Gabon 64
Somalia 64
Tanzania 64
Ethiopia 63
Burma 62
Sudan 61
Liberia 60
Maldives 58
Equatorial Guinea 57
Cabo Verde 55
Sierra Leone 53
Guinea-Bissau 52
Togo 52
Zambia 51
Eswatini 50
Chad 49
Tajikistan 48
Haiti 46
Sao Tome and Principe 46
Benin 44
Nepal 44
Uganda 44
Central African Republic 43
South Sudan 43
Guyana 41
Mozambique 40
Yemen 36
Mongolia 35
Mauritania 32
Nicaragua 32
Malawi 26
Syria 26
Zimbabwe 24
Bahamas 23
Libya 23
Comoros 21
Suriname 13
Angola 10
Eritrea 5
Burundi 4
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 150
Korea, South 121
Japan 120
Italy 118
Iran 115
Singapore 112
France 111
Germany 111
Spain 110
US 108
Switzerland 107
United Kingdom 107
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-03-30        18351
## 2    Afghanistan           <NA> <NA> 2020-03-25        18346
## 3    Afghanistan           <NA> <NA> 2020-05-21        18403
## 4    Afghanistan           <NA> <NA> 2020-03-29        18350
## 5    Afghanistan           <NA> <NA> 2020-01-23        18284
## 6    Afghanistan           <NA> <NA> 2020-03-31        18352
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      4                   170     0.02352941
## 2                      2                    84     0.02380952
## 3                    193                  8676     0.02224527
## 4                      4                   120     0.03333333
## 5                      0                     0            NaN
## 6                      4                   174     0.02298851
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  2.230449                   0.602060        18348
## 2                  1.924279                   0.301030        18348
## 3                  3.938320                   2.285557        18348
## 4                  2.079181                   0.602060        18348
## 5                      -Inf                       -Inf        18348
## 6                  2.240549                   0.602060        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1              3  NA   NA         NA                           NA
## 2             -2  NA   NA         NA                           NA
## 3             55  NA   NA         NA                           NA
## 4              2  NA   NA         NA                           NA
## 5            -64  NA   NA         NA                           NA
## 6              4  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9728199 -34.0
New Hampshire Parks 0.9501439 -20.0
Hawaii Parks 0.9427290 -72.0
Vermont Parks 0.9382721 -35.5
Hawaii Retail_Recreation 0.9343342 -56.0
Maine Transit -0.9171468 -50.0
Hawaii Transit 0.9121202 -89.0
Connecticut Grocery_Pharmacy -0.8831135 -6.0
Utah Residential -0.8689705 12.0
Utah Transit -0.8531298 -18.0
South Dakota Parks 0.8026296 -26.0
Arizona Grocery_Pharmacy -0.7778718 -15.0
Alaska Grocery_Pharmacy -0.7684680 -7.0
Rhode Island Workplace -0.7677330 -39.5
Wyoming Parks -0.7563518 -4.0
Connecticut Transit -0.7513477 -50.0
Massachusetts Workplace -0.7512392 -39.0
Arizona Residential 0.7278778 13.0
Alaska Residential 0.6930065 13.0
Nevada Transit -0.6709874 -20.0
Alaska Workplace -0.6665641 -33.0
New York Workplace -0.6538019 -34.5
Vermont Grocery_Pharmacy -0.6513546 -25.0
North Dakota Parks 0.6374063 -34.0
Rhode Island Retail_Recreation -0.6311475 -45.0
Idaho Residential -0.6259130 11.0
Utah Parks -0.6070075 17.0
New Jersey Parks -0.6023186 -6.0
Rhode Island Residential -0.5983355 18.5
Maine Workplace -0.5905117 -30.0
New York Retail_Recreation -0.5901671 -46.0
Nebraska Workplace 0.5816746 -32.0
Utah Workplace -0.5395724 -37.0
Arizona Retail_Recreation -0.5326195 -42.5
New Jersey Workplace -0.5299701 -44.0
New York Parks 0.5298217 20.0
Arkansas Parks 0.5292631 -12.0
Connecticut Retail_Recreation -0.5197930 -45.0
Connecticut Residential 0.5087907 14.0
Maine Parks 0.4956702 -31.0
Massachusetts Retail_Recreation -0.4951668 -44.0
New Jersey Grocery_Pharmacy -0.4874953 2.5
New Mexico Grocery_Pharmacy -0.4839658 -11.0
Connecticut Workplace -0.4731373 -39.0
Arizona Transit 0.4661582 -38.0
Nebraska Residential -0.4653667 14.0
New Mexico Residential 0.4570996 13.5
Maryland Workplace -0.4561877 -35.0
Montana Parks -0.4535515 -58.0
Iowa Parks -0.4489113 28.5
Rhode Island Parks 0.4481230 52.0
West Virginia Parks 0.4461255 -33.0
North Dakota Retail_Recreation -0.4334704 -42.0
Illinois Transit -0.4267512 -31.0
New Jersey Retail_Recreation -0.4188775 -62.5
Pennsylvania Workplace -0.4180195 -36.0
Kansas Parks 0.4127334 72.0
Maryland Grocery_Pharmacy -0.4125230 -10.0
New Jersey Transit -0.4105453 -50.5
Hawaii Residential -0.4099228 19.0
Montana Residential 0.4085711 14.0
Massachusetts Grocery_Pharmacy -0.4071330 -7.0
Kentucky Parks -0.4005782 28.5
Vermont Residential 0.4003312 11.5
New Mexico Parks 0.3991179 -31.5
Pennsylvania Retail_Recreation -0.3934503 -45.0
New Hampshire Residential -0.3912355 14.0
South Carolina Workplace 0.3881755 -30.0
Oregon Parks -0.3844636 16.5
Alabama Workplace -0.3837270 -29.0
Michigan Parks 0.3757190 28.5
New Mexico Retail_Recreation -0.3714644 -42.5
Alabama Grocery_Pharmacy -0.3670667 -2.0
New York Transit -0.3666371 -48.0
Missouri Residential -0.3607560 13.0
Nebraska Grocery_Pharmacy 0.3419537 -0.5
North Dakota Workplace 0.3380017 -40.0
Wisconsin Transit -0.3329210 -23.5
South Carolina Parks -0.3310559 -23.0
Arkansas Retail_Recreation -0.3249239 -30.0
Maryland Retail_Recreation -0.3246891 -39.0
Virginia Transit -0.3211048 -32.5
Alaska Retail_Recreation 0.3199889 -39.0
Montana Transit 0.3199344 -41.0
Idaho Workplace -0.3171832 -29.0
Maine Retail_Recreation -0.3102651 -42.0
Idaho Grocery_Pharmacy -0.3057142 -5.5
Nevada Residential 0.3006302 17.0
California Residential 0.2992010 14.0
Colorado Residential 0.2982781 14.0
Florida Residential 0.2970318 14.0
California Transit -0.2951189 -42.0
California Parks -0.2947664 -38.5
Illinois Workplace -0.2922256 -30.5
Minnesota Transit -0.2843422 -28.5
Idaho Transit -0.2824205 -30.0
Pennsylvania Parks 0.2789666 12.0
Vermont Retail_Recreation 0.2762444 -57.0
Wyoming Grocery_Pharmacy -0.2760831 -10.0
North Carolina Workplace 0.2678974 -31.0
North Carolina Grocery_Pharmacy 0.2675395 0.0
Pennsylvania Grocery_Pharmacy -0.2651265 -6.0
Vermont Workplace -0.2644230 -43.0
Oregon Grocery_Pharmacy -0.2639225 -7.0
Maryland Residential 0.2622933 15.0
Kansas Workplace 0.2611130 -32.5
Alabama Transit -0.2587619 -36.5
Georgia Grocery_Pharmacy -0.2581461 -10.0
West Virginia Grocery_Pharmacy -0.2579266 -6.0
Rhode Island Grocery_Pharmacy 0.2579082 -7.5
Mississippi Residential 0.2511330 13.0
Illinois Parks 0.2509280 26.5
Tennessee Workplace -0.2488339 -31.0
Rhode Island Transit -0.2479675 -56.0
Texas Workplace 0.2424410 -32.0
Tennessee Residential 0.2416965 11.5
Wyoming Workplace -0.2395925 -31.0
Washington Grocery_Pharmacy 0.2394885 -7.0
Florida Parks -0.2386897 -43.0
Wisconsin Parks 0.2338838 51.5
Georgia Workplace -0.2295514 -33.5
Nevada Retail_Recreation -0.2289952 -43.0
Missouri Workplace 0.2283492 -29.0
New York Grocery_Pharmacy -0.2281747 8.0
Minnesota Parks 0.2280731 -9.0
Nevada Workplace -0.2274478 -40.0
Tennessee Parks -0.2180671 10.5
Georgia Retail_Recreation -0.2178686 -41.0
Arizona Parks -0.2161897 -44.5
Indiana Parks -0.2134691 29.0
South Dakota Workplace 0.2132298 -35.0
Texas Residential -0.2128364 15.0
West Virginia Workplace 0.2100889 -33.0
North Carolina Transit 0.2027812 -32.0
North Dakota Grocery_Pharmacy -0.2025164 -8.0
Nebraska Transit -0.2015676 -9.0
Connecticut Parks 0.2004977 43.0
Oregon Residential -0.2001144 10.5
Kansas Grocery_Pharmacy -0.2000549 -14.0
Mississippi Grocery_Pharmacy -0.1934143 -8.0
Illinois Residential 0.1923709 14.0
Alabama Parks 0.1904919 -1.0
Colorado Parks -0.1890174 2.0
Virginia Residential 0.1808773 14.0
Wisconsin Residential -0.1803076 14.0
Wyoming Transit -0.1791527 -17.0
North Carolina Residential 0.1773579 13.0
New Hampshire Retail_Recreation -0.1713674 -41.0
Kentucky Transit -0.1705900 -31.0
Pennsylvania Transit -0.1699170 -42.0
Missouri Parks 0.1678708 0.0
Kentucky Grocery_Pharmacy 0.1676493 4.0
Texas Parks 0.1668031 -42.0
Ohio Transit 0.1650446 -28.0
Indiana Residential 0.1638952 12.0
Massachusetts Parks 0.1626536 39.0
Idaho Retail_Recreation -0.1613320 -39.5
South Dakota Residential 0.1613167 15.0
Utah Retail_Recreation -0.1608488 -40.0
Oklahoma Parks 0.1555022 -18.5
New Jersey Residential 0.1551174 18.0
New Mexico Transit 0.1511588 -38.5
North Dakota Residential -0.1418458 17.0
Kentucky Residential 0.1400809 12.0
Idaho Parks 0.1386210 -22.0
Minnesota Retail_Recreation 0.1377553 -40.5
Texas Grocery_Pharmacy 0.1340332 -14.0
Virginia Grocery_Pharmacy -0.1329374 -8.0
Iowa Transit 0.1328430 -24.0
North Dakota Transit 0.1310635 -48.0
Michigan Workplace -0.1272400 -40.0
Indiana Retail_Recreation 0.1256063 -38.0
Mississippi Retail_Recreation -0.1250649 -40.0
Missouri Transit -0.1249292 -24.5
Utah Grocery_Pharmacy 0.1232037 -4.0
West Virginia Residential -0.1226764 11.0
North Carolina Retail_Recreation 0.1216175 -34.0
Wisconsin Grocery_Pharmacy 0.1211142 -1.0
Montana Retail_Recreation 0.1208268 -50.0
California Grocery_Pharmacy -0.1168041 -11.5
Montana Workplace -0.1137233 -40.0
Oregon Transit 0.1109754 -27.5
Wisconsin Workplace -0.1108910 -31.0
Maryland Transit -0.1080809 -39.0
Florida Retail_Recreation 0.1073696 -43.0
Nebraska Retail_Recreation 0.1068213 -36.0
Alabama Retail_Recreation 0.1063298 -39.0
Iowa Workplace 0.1060785 -30.0
Mississippi Parks -0.1055566 -25.0
Oklahoma Grocery_Pharmacy -0.1051735 -0.5
Arkansas Workplace -0.1051511 -26.0
Wyoming Residential 0.1044984 12.5
New Hampshire Grocery_Pharmacy -0.1030442 -6.0
Kansas Transit -0.1016779 -26.5
Massachusetts Transit -0.1016515 -45.0
West Virginia Transit -0.1011672 -45.0
Virginia Parks 0.1007623 6.0
Florida Transit -0.1002560 -49.0
Massachusetts Residential 0.1000036 15.0
California Retail_Recreation -0.0996705 -44.0
Oregon Retail_Recreation 0.0992354 -40.5
Mississippi Transit -0.0963377 -38.5
Michigan Transit 0.0962120 -46.0
Minnesota Grocery_Pharmacy 0.0953297 -6.0
Indiana Workplace 0.0932014 -34.0
Iowa Grocery_Pharmacy -0.0919776 4.0
Arkansas Residential 0.0916552 12.0
New York Residential 0.0899155 17.5
Oklahoma Workplace 0.0894714 -31.0
Michigan Residential 0.0894061 15.0
Hawaii Workplace 0.0887792 -46.0
Ohio Residential 0.0865750 14.0
Nevada Parks 0.0842794 -12.5
Michigan Retail_Recreation -0.0837407 -53.0
Minnesota Residential -0.0809761 17.0
Virginia Retail_Recreation -0.0795080 -35.0
Alabama Residential -0.0779142 11.0
Pennsylvania Residential 0.0769491 15.0
Oregon Workplace -0.0765583 -31.0
Texas Transit 0.0746392 -41.0
Kentucky Retail_Recreation 0.0734918 -29.0
South Dakota Transit -0.0727265 -40.0
Ohio Grocery_Pharmacy 0.0710866 0.0
Texas Retail_Recreation 0.0709673 -40.0
Georgia Residential -0.0700149 13.0
Georgia Parks 0.0697419 -6.0
Virginia Workplace -0.0620290 -32.0
Ohio Parks -0.0600827 67.5
Washington Transit -0.0590016 -33.5
Maine Residential -0.0576593 11.0
Vermont Transit -0.0569120 -63.0
North Carolina Parks -0.0557371 7.0
South Dakota Retail_Recreation -0.0530484 -39.0
Washington Retail_Recreation 0.0505022 -42.0
South Carolina Transit -0.0495286 -45.0
Minnesota Workplace -0.0490731 -33.0
Georgia Transit -0.0484747 -35.0
Ohio Retail_Recreation 0.0476514 -36.0
Iowa Retail_Recreation 0.0472812 -38.0
Tennessee Transit 0.0456696 -32.0
Indiana Transit 0.0453054 -29.0
Kentucky Workplace -0.0449162 -36.5
Florida Grocery_Pharmacy 0.0434744 -14.0
New Hampshire Workplace 0.0434574 -37.0
Colorado Transit 0.0434181 -36.0
Arkansas Grocery_Pharmacy -0.0407579 3.0
Ohio Workplace -0.0382458 -35.0
Nebraska Parks 0.0379763 55.5
West Virginia Retail_Recreation -0.0375638 -38.5
New Hampshire Transit 0.0373608 -57.0
Illinois Retail_Recreation 0.0361844 -40.0
Arizona Workplace -0.0352592 -35.0
New Mexico Workplace 0.0341258 -34.0
California Workplace -0.0337300 -36.0
Wyoming Retail_Recreation 0.0328283 -39.0
Colorado Retail_Recreation -0.0325785 -44.0
Colorado Grocery_Pharmacy -0.0315579 -17.0
Illinois Grocery_Pharmacy -0.0298483 2.0
South Carolina Residential -0.0287718 12.0
Missouri Retail_Recreation -0.0269839 -36.0
Maine Grocery_Pharmacy -0.0260659 -13.0
Washington Parks 0.0253213 -3.5
Tennessee Retail_Recreation -0.0240760 -30.0
Wisconsin Retail_Recreation 0.0204950 -44.0
South Dakota Grocery_Pharmacy -0.0192226 -9.0
Oklahoma Residential -0.0173644 15.0
Washington Workplace -0.0169928 -38.0
Kansas Residential -0.0157511 13.0
Maryland Parks 0.0152574 27.0
Mississippi Workplace -0.0152403 -33.0
Arkansas Transit -0.0147621 -27.0
Tennessee Grocery_Pharmacy 0.0128721 6.0
Indiana Grocery_Pharmacy -0.0122233 -5.5
Oklahoma Transit 0.0107176 -26.0
Colorado Workplace 0.0103507 -39.0
Florida Workplace 0.0100169 -33.0
Montana Grocery_Pharmacy 0.0092244 -16.5
Oklahoma Retail_Recreation -0.0084572 -31.0
Washington Residential 0.0083266 13.0
South Carolina Retail_Recreation 0.0082692 -35.0
South Carolina Grocery_Pharmacy 0.0072017 1.0
Michigan Grocery_Pharmacy -0.0071321 -11.0
Iowa Residential -0.0039947 13.0
Nevada Grocery_Pharmacy -0.0036007 -12.5
Kansas Retail_Recreation -0.0019531 -38.0
Missouri Grocery_Pharmacy -0.0018730 2.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Sat Jun 20 15:15:40 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net